| Provider | Model | Version | Estimate | Rank | |
|---|---|---|---|---|---|
| 1 | anthropic | Claude 3.7 Sonnet | claude-3-7-sonnet-20250219 | 3.8580848 | top |
| 2 | anthropic | Claude 3.5 Sonnet | claude-3-5-sonnet-20241022 | 3.4210271 | top |
| 3 | xai | Grok 3 Beta | grok-3-beta | 3.0488472 | top |
| 4 | anthropic | Claude 3 Haiku | claude-3-haiku-20240307 | 0.3764656 | bottom |
| 5 | cohere | Command R | command-r-08-2024 | 0.3764656 | bottom |
| 6 | openai | GPT-3.5 Turbo | gpt-3.5-turbo | 0.3299676 | bottom |
| 7 | openai | GPT-4o Mini | gpt-4o-mini | 0.2865677 | bottom |
| 8 | Gemini 2.5 Flash | gemini-2.5-flash | NA | new |
Building on our previous analysis, we selected models based on their performance. We chose 4 top1, which were consistently more consistent than chance, and 4 bottom models, which were consistently less consistent than chance in terms of deliberative reasoning.
| case | survey | N | topic | subtopic | |
|---|---|---|---|---|---|
| 1 | CCPS ACT Deliberative | ccps | 31 | climate | climate |
| 2 | CSIRO WA | energy_futures | 17 | climate | energy |
| 3 | Winterthur | zh_winterthur | 16 | climate | climate |
| survey | considerations | policies | scale_max | q_method | |
|---|---|---|---|---|---|
| 1 | ccps | 33 | 7 | 11 | FALSE |
| 2 | energy_futures | 45 | 9 | 11 | FALSE |
| 3 | zh_winterthur | 30 | 6 | 7 | FALSE |
| uid | type | article | role | description | |
|---|---|---|---|---|---|
| 1 | eco | ideology | an | ecologist | focuses on environmental protection and sustainability, advocating for societal change to ecological limits |
| 2 | coa | perspective | a | coastal resident | endures chronic flooding and salinization, forced to relocate due to rising sea levels and intense storms worsened by climate change |
| 3 | ctr | perspective | a | construction worker | suffers from extreme heat stress and lost work hours, perceiving climate change making outdoor labor unbearable and life-threatening |
| 4 | dis | perspective | a | disease survivor | recovers from dengue fever, aware that climate change’s rising temperatures are expanding the range of disease-carrying mosquitoes in their region |
| 5 | eld | perspective | an | elderly urban resident | endures intensified city heatwaves, struggling with disrupted services and feeling the direct, severe impact of climate change |
| 6 | far | perspective | a | displaced family | loses their home due to unprecedented wildfires, experiencing displacement and recognizing climate change as the major driver of the devastation |
| 7 | fis | perspective | a | fisher | notes his declining catches due to warming oceans, understanding that climate change is reorganizing marine life and reducing their traditional yield |
| 8 | lan | perspective | a | landowner | surveys his parched fields after a prolonged drought, feeling the compounding impacts of climate change that reduce crop yields and family income |
| 9 | par | perspective | a | parent | sees their child fall ill from a water-borne disease, attributing its spread to the increased heavy rainfall and warmer temperatures brought by climate change |
| 10 | sub | perspective | a | subsistence farmer | watches his crops wither under erratic rainfall patterns, and who sees these changes as direct consequence of climate change |
| 11 | vil | perspective | a | villager | faces dwindling, contaminated water supplies due to extended draughts and floods, aware that climate change is altering their water security |
| 12 | csk | devils | a | climate skeptic | prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science |
We collected 1440 responses generated by 8 models cross 3 surveys and 12 roles described above. We prompted each LLM 5 times with the same prompt.
We instructed LLMs to play each of the roles described above by including a system instruction in each request following the pattern:
Answer the following prompts as [article] [role], who [description].
For example:
Answer the following prompts as a climate skeptic, who prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science.
We calculated one DRI value per model/survey/role by treating each LLM response as one participant in a deliberation. The role “all” indicates that all roles were part of that deliberation (n = 60 participants, which equals 5 participants for each of the 12 roles). DRI plots are shown in Figure 5.3.
| model | survey | obs_mean | N | mu | p_value_two.sided | sig_two.sided | p_value_greater | sig_greater |
|---|---|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | ccps | 0.3759073 | 12 | 0 | 0.0009766 | * | 0.0004883 | * |
| Claude 3.5 Sonnet | energy_futures | 0.4695921 | 12 | 0 | 0.0009766 | * | 0.0004883 | * |
| Claude 3.5 Sonnet | zh_winterthur | 0.5683774 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Claude 3.7 Sonnet | ccps | 0.6819898 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Claude 3.7 Sonnet | energy_futures | 0.6173198 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Claude 3.7 Sonnet | zh_winterthur | 0.5911667 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Grok 3 Beta | ccps | 0.3605863 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Grok 3 Beta | energy_futures | 0.7103851 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Grok 3 Beta | zh_winterthur | 0.7314191 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Gemini 2.5 Flash | ccps | 0.8336696 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Gemini 2.5 Flash | energy_futures | 0.5166190 | 12 | 0 | 0.0009766 | * | 0.0004883 | * |
| Gemini 2.5 Flash | zh_winterthur | 0.6778375 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| GPT-4o Mini | ccps | 0.0427425 | 12 | 0 | 0.6772461 | n.s. | 0.3386230 | n.s. |
| GPT-4o Mini | energy_futures | -0.0899976 | 12 | 0 | 0.5693359 | n.s. | 0.7407227 | n.s. |
| GPT-4o Mini | zh_winterthur | -0.2190937 | 12 | 0 | 0.0771484 | n.s. | 0.9680176 | n.s. |
| GPT-3.5 Turbo | ccps | -0.2532340 | 12 | 0 | 0.0161133 | * | 0.9938965 | n.s. |
| GPT-3.5 Turbo | energy_futures | -0.2836284 | 12 | 0 | 0.0122070 | * | 0.9953613 | n.s. |
| GPT-3.5 Turbo | zh_winterthur | -0.4205772 | 12 | 0 | 0.0034180 | * | 0.9987793 | n.s. |
| Command R | ccps | -0.4709172 | 12 | 0 | 0.0004883 | * | 1.0000000 | n.s. |
| Command R | energy_futures | -0.0245292 | 12 | 0 | 0.7910156 | n.s. | 0.6333008 | n.s. |
| Command R | zh_winterthur | -0.9582444 | 12 | 0 | 0.0004883 | * | 1.0000000 | n.s. |
| Claude 3 Haiku | ccps | -0.3105968 | 12 | 0 | 0.0004883 | * | 1.0000000 | n.s. |
| Claude 3 Haiku | energy_futures | -0.3584220 | 12 | 0 | 0.0009766 | * | 0.9997559 | n.s. |
| Claude 3 Haiku | zh_winterthur | -0.6380549 | 12 | 0 | 0.0004883 | * | 1.0000000 | n.s. |
## Linear mixed model fit by REML ['lmerMod']
## Formula: dri ~ model + (1 | role) + (1 | survey)
## Data: df
##
## REML criterion at convergence: 127.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.86198 -0.63430 0.03286 0.59691 3.03838
##
## Random effects:
## Groups Name Variance Std.Dev.
## role (Intercept) 0.002483 0.04983
## survey (Intercept) 0.005538 0.07442
## Residual 0.080233 0.28326
## Number of obs: 288, groups: role, 12; survey, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.43569 0.06544 -6.658
## modelClaude 3.5 Sonnet 0.90698 0.06676 13.585
## modelClaude 3.7 Sonnet 1.06585 0.06676 15.964
## modelCommand R -0.04887 0.06676 -0.732
## modelGemini 2.5 Flash 1.11173 0.06676 16.652
## modelGPT-3.5 Turbo 0.11654 0.06676 1.746
## modelGPT-4o Mini 0.34691 0.06676 5.196
## modelGrok 3 Beta 1.03649 0.06676 15.525
##
## Correlation of Fixed Effects:
## (Intr) mC3.5S mC3.7S mdlCmR mG2.5F mGPT-T mGPT-M
## mdlCld3.5Sn -0.510
## mdlCld3.7Sn -0.510 0.500
## modelCmmndR -0.510 0.500 0.500
## mdlGmn2.5Fl -0.510 0.500 0.500 0.500
## mdlGPT-3.5T -0.510 0.500 0.500 0.500 0.500
## modlGPT-4Mn -0.510 0.500 0.500 0.500 0.500 0.500
## modelGrk3Bt -0.510 0.500 0.500 0.500 0.500 0.500 0.500
## Linear mixed model fit by REML ['lmerMod']
## Formula: dri ~ model + (1 | survey/role)
## Data: df
##
## REML criterion at convergence: 128.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.95607 -0.66969 -0.00041 0.65619 3.06045
##
## Random effects:
## Groups Name Variance Std.Dev.
## role:survey (Intercept) 0.0009013 0.03002
## survey (Intercept) 0.0054477 0.07381
## Residual 0.0817355 0.28589
## Number of obs: 288, groups: role:survey, 36; survey, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.43569 0.06412 -6.795
## modelClaude 3.5 Sonnet 0.90698 0.06739 13.460
## modelClaude 3.7 Sonnet 1.06585 0.06739 15.817
## modelCommand R -0.04887 0.06739 -0.725
## modelGemini 2.5 Flash 1.11173 0.06739 16.498
## modelGPT-3.5 Turbo 0.11654 0.06739 1.730
## modelGPT-4o Mini 0.34691 0.06739 5.148
## modelGrok 3 Beta 1.03649 0.06739 15.381
##
## Correlation of Fixed Effects:
## (Intr) mC3.5S mC3.7S mdlCmR mG2.5F mGPT-T mGPT-M
## mdlCld3.5Sn -0.525
## mdlCld3.7Sn -0.525 0.500
## modelCmmndR -0.525 0.500 0.500
## mdlGmn2.5Fl -0.525 0.500 0.500 0.500
## mdlGPT-3.5T -0.525 0.500 0.500 0.500 0.500
## modlGPT-4Mn -0.525 0.500 0.500 0.500 0.500 0.500
## modelGrk3Bt -0.525 0.500 0.500 0.500 0.500 0.500 0.500
## boundary (singular) fit: see help('isSingular')
## refitting model(s) with ML (instead of REML)
## Data: df
## Models:
## m0: dri ~ 1 + (1 | survey/role)
## m1: dri ~ model + (1 | survey/role)
## npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
## m0 4 490.84 505.49 -241.420 482.84
## m1 11 118.59 158.89 -48.297 96.59 386.25 7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## model emmean SE df lower.CL upper.CL
## Gemini 2.5 Flash 0.6760 0.0641 7.44 0.526 0.8259
## Claude 3.7 Sonnet 0.6302 0.0641 7.44 0.480 0.7800
## Grok 3 Beta 0.6008 0.0641 7.44 0.451 0.7506
## Claude 3.5 Sonnet 0.4713 0.0641 7.44 0.321 0.6211
## GPT-4o Mini -0.0888 0.0641 7.44 -0.239 0.0611
## GPT-3.5 Turbo -0.3191 0.0641 7.44 -0.469 -0.1693
## Claude 3 Haiku -0.4357 0.0641 7.44 -0.586 -0.2859
## Command R -0.4846 0.0641 7.44 -0.634 -0.3347
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
## # A tibble: 12 × 3
## role mean_dri sd_dri
## <chr> <dbl> <dbl>
## 1 coa 0.125 0.547
## 2 csk 0.287 0.550
## 3 ctr 0.189 0.457
## 4 dis 0.0416 0.564
## 5 eco 0.141 0.638
## 6 eld 0.149 0.531
## 7 far 0.0617 0.612
## 8 fis 0.0519 0.604
## 9 lan 0.170 0.506
## 10 par 0.111 0.608
## 11 sub 0.210 0.541
## 12 vil 0.0379 0.616
## # A tibble: 12 × 4
## role mean_role_noise max_role_noise min_role_noise
## <chr> <dbl> <dbl> <dbl>
## 1 coa 0.246 0.549 0.116
## 2 csk 0.187 0.370 0.00776
## 3 ctr 0.299 0.402 0.106
## 4 dis 0.217 0.369 0.00799
## 5 eco 0.233 0.517 0.0277
## 6 eld 0.245 0.724 0.0452
## 7 far 0.221 0.373 0.0647
## 8 fis 0.192 0.566 0.0365
## 9 lan 0.251 0.442 0.121
## 10 par 0.304 0.559 0.0512
## 11 sub 0.349 0.685 0.128
## 12 vil 0.301 0.571 0.0186
##
## Fligner-Killeen test of homogeneity of variances
##
## data: sd_rep by role
## Fligner-Killeen:med chi-squared = 8.0891, df = 11, p-value = 0.7053
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 7 1.8873 0.08108 .
## 88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 7 1.8873 0.08108 .
## 88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We compared the compared top with bottom models in terms of consistency of DRI and Cronbach’s Alpha (see top models in Figure 5.1 and bottom models in Figure 5.2).
Figure 5.1: Top models
We found that top LLMs are consistent across roles both in terms of DRI and Cronbach’s Alpha (policies). The high DRI across roles (median = 0.637; IQR = 0.161) suggests that LLMs tend to consistenly align their considerations and policy preferences. The high Cronbach’s alpha for their policy preferences (median = 0.784; IQR = 0.047) suggests that LLMs tend to agree on the ranking of their policy preferences.
Figure 5.2: Bottom models
We also found that bottom LLMs are not consistent across roles in terms of DRI and less consistent than top models in terms of Cronbach’s Alpha (policies). The low DRI across roles (median = -0.177; IQR = 0.163) suggests that LLMs tend to consistenly misalign their considerations and policy preferences. The Cronbach’s alpha (lower than top models) for their policy preferences (median = 0.635; IQR = 0.11) suggests that LLMs tend to agree less on the ranking of their policy preferences than top models.
| role | claude-3-5-sonnet-20241022 | claude-3-7-sonnet-20250219 | claude-3-haiku-20240307 | command-r-08-2024 | gemini-2.5-flash | gpt-3.5-turbo | gpt-4o-mini | grok-3-beta | best_model | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | all | 0.512 | 0.639 | -0.291 | -0.281 | 0.638 | -0.213 | 0.000 | 0.625 | claude-3-7-sonnet-20250219 |
| 2 | coa | 0.350 | 0.565 | -0.526 | -0.435 | 0.810 | -0.315 | -0.019 | 0.567 | gemini-2.5-flash |
| 3 | csk | 0.543 | 0.773 | -0.118 | -0.580 | 0.875 | 0.163 | -0.153 | 0.795 | gemini-2.5-flash |
| 4 | ctr | 0.343 | 0.567 | -0.368 | -0.264 | 0.663 | -0.129 | 0.252 | 0.447 | gemini-2.5-flash |
| 5 | dis | 0.476 | 0.538 | -0.553 | -0.490 | 0.569 | -0.719 | 0.057 | 0.455 | gemini-2.5-flash |
| 6 | eco | 0.364 | 0.720 | -0.281 | -0.831 | 0.854 | -0.472 | 0.084 | 0.696 | gemini-2.5-flash |
| 7 | eld | 0.404 | 0.498 | -0.335 | -0.396 | 0.796 | -0.078 | -0.322 | 0.626 | gemini-2.5-flash |
| 8 | far | 0.479 | 0.651 | -0.524 | -0.673 | 0.821 | -0.388 | -0.370 | 0.497 | gemini-2.5-flash |
| 9 | fis | 0.497 | 0.593 | -0.492 | -0.560 | 0.685 | -0.665 | -0.244 | 0.602 | gemini-2.5-flash |
| 10 | lan | 0.595 | 0.633 | -0.318 | -0.347 | 0.477 | -0.466 | 0.199 | 0.587 | claude-3-7-sonnet-20250219 |
| 11 | par | 0.498 | 0.708 | -0.669 | -0.472 | 0.598 | -0.164 | -0.284 | 0.670 | claude-3-7-sonnet-20250219 |
| 12 | sub | 0.526 | 0.712 | -0.433 | -0.218 | 0.556 | -0.106 | -0.014 | 0.654 | claude-3-7-sonnet-20250219 |
| 13 | vil | 0.581 | 0.604 | -0.612 | -0.550 | 0.407 | -0.490 | -0.252 | 0.613 | grok-3-beta |
| role | claude-3-5-sonnet-20241022 | claude-3-7-sonnet-20250219 | claude-3-haiku-20240307 | command-r-08-2024 | gemini-2.5-flash | gpt-3.5-turbo | gpt-4o-mini | grok-3-beta | best_model | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | all | 0.725 | 0.792 | 0.614 | 0.638 | 0.801 | 0.599 | 0.641 | 0.818 | grok-3-beta |
| 2 | coa | 0.713 | 0.745 | 0.816 | 0.808 | 0.771 | 0.737 | 0.763 | 0.807 | claude-3-haiku-20240307 |
| 3 | csk | 0.783 | 0.802 | 0.813 | 0.708 | 0.848 | 0.764 | 0.715 | 0.851 | grok-3-beta |
| 4 | ctr | 0.749 | 0.791 | 0.774 | 0.776 | 0.918 | 0.787 | 0.727 | 0.755 | gemini-2.5-flash |
| 5 | dis | 0.761 | 0.772 | 0.669 | 0.802 | 0.771 | 0.762 | 0.756 | 0.796 | command-r-08-2024 |
| 6 | eco | 0.764 | 0.844 | 0.711 | 0.730 | 0.814 | 0.800 | 0.759 | 0.716 | claude-3-7-sonnet-20250219 |
| 7 | eld | 0.722 | 0.793 | 0.788 | 0.740 | 0.741 | 0.801 | 0.813 | 0.828 | grok-3-beta |
| 8 | far | 0.726 | 0.807 | 0.791 | 0.843 | 0.827 | 0.769 | 0.828 | 0.824 | command-r-08-2024 |
| 9 | fis | 0.787 | 0.792 | 0.690 | 0.793 | 0.829 | 0.750 | 0.825 | 0.704 | gemini-2.5-flash |
| 10 | lan | 0.715 | 0.792 | 0.802 | 0.805 | 0.789 | 0.783 | 0.795 | 0.792 | command-r-08-2024 |
| 11 | par | 0.785 | 0.704 | 0.774 | 0.777 | 0.790 | 0.778 | 0.762 | 0.833 | grok-3-beta |
| 12 | sub | 0.841 | 0.800 | 0.671 | 0.754 | 0.761 | 0.760 | 0.803 | 0.839 | claude-3-5-sonnet-20241022 |
| 13 | vil | 0.708 | 0.818 | 0.770 | 0.794 | 0.808 | 0.786 | 0.798 | 0.662 | claude-3-7-sonnet-20250219 |
| role | claude-3-5-sonnet-20241022 | claude-3-7-sonnet-20250219 | claude-3-haiku-20240307 | command-r-08-2024 | gemini-2.5-flash | gpt-3.5-turbo | gpt-4o-mini | grok-3-beta | best_model | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | all | 0.990 | 0.990 | 0.976 | 0.975 | 0.984 | 0.911 | 0.976 | 0.987 | claude-3-5-sonnet-20241022 |
| 2 | coa | 0.863 | 0.918 | 0.880 | 0.787 | 0.849 | 0.886 | 0.837 | 0.891 | claude-3-7-sonnet-20250219 |
| 3 | csk | 0.769 | 0.856 | 0.898 | 0.767 | 0.551 | 0.952 | 0.817 | 0.831 | gpt-3.5-turbo |
| 4 | ctr | 0.916 | 0.909 | 0.872 | 0.915 | 0.852 | 0.916 | 0.852 | 0.906 | claude-3-5-sonnet-20241022 |
| 5 | dis | 0.905 | 0.921 | 0.894 | 0.904 | 0.859 | 0.918 | 0.876 | 0.896 | claude-3-7-sonnet-20250219 |
| 6 | eco | 0.900 | 0.860 | 0.884 | 0.827 | 0.842 | 0.865 | 0.871 | 0.863 | claude-3-5-sonnet-20241022 |
| 7 | eld | 0.917 | 0.899 | 0.919 | 0.886 | 0.917 | 0.911 | 0.879 | 0.903 | claude-3-haiku-20240307 |
| 8 | far | 0.905 | 0.848 | 0.919 | 0.747 | 0.815 | 0.774 | 0.860 | 0.905 | claude-3-haiku-20240307 |
| 9 | fis | 0.916 | 0.895 | 0.894 | 0.907 | 0.896 | 0.918 | 0.891 | 0.905 | gpt-3.5-turbo |
| 10 | lan | 0.917 | 0.914 | 0.884 | 0.904 | 0.884 | 0.885 | 0.909 | 0.917 | claude-3-5-sonnet-20241022 |
| 11 | par | 0.925 | 0.905 | 0.863 | 0.867 | 0.830 | 0.888 | 0.885 | 0.922 | claude-3-5-sonnet-20241022 |
| 12 | sub | 0.902 | 0.919 | 0.895 | 0.758 | 0.851 | 0.889 | 0.906 | 0.911 | claude-3-7-sonnet-20250219 |
| 13 | vil | 0.881 | 0.880 | 0.914 | 0.901 | 0.873 | 0.927 | 0.895 | 0.887 | gpt-3.5-turbo |
These plots show a simulated deliberation across all 12 roles for each surveys and model. Each simulated deliberation has 60 participants (12 roles with 5 participants each).
Note that bottom models are visually inconsistent.
Figure 5.3: DRI Plots
These plots show a simulated deliberation across all models in the same class (i.e., top, bottom) for each role and survey. Each simulated deliberation has 20 participants (4 models with 5 participants each).
Note that top models are visually more consistent than bottom models.